For this portfolio piece, what I would like to do is take some data about the music I listened to in 2022, and create a nice-looking and somehow informative visual using that data.
To start with, I’ll need to load some packages, as well as my spotify data, which I requested from spotify and downloaded as a json file. The dataset contains all of my spotify listening data for the year 2022.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
spotify <- "StreamingHistory0.JSON" %>%
fromJSON() %>%
as_tibble() %>%
glimpse()
## Rows: 7,017
## Columns: 4
## $ endTime <chr> "2022-02-24 00:05", "2022-02-24 02:06", "2022-02-24 02:17",…
## $ artistName <chr> "Two Psychologists Four Beers", "Kamasi Washington", "Super…
## $ trackName <chr> "Episode 81: Against Retribution", "Clair de Lune", "Vive L…
## $ msPlayed <int> 2164222, 1270, 326, 667733, 828000, 963, 2152, 7530, 254053…
First, I’m only interested in music for this project, so I need to filter out podcasts from the dataset.
spotify %>%
count(artistName) %>%
arrange(desc(n))
## # A tibble: 1,009 × 2
## artistName n
## <chr> <int>
## 1 The Mountain Goats 710
## 2 Very Bad Wizards 526
## 3 Unknown Artist 145
## 4 The Beatles 140
## 5 Japanese Breakfast 117
## 6 Phoebe Bridgers 102
## 7 Wednesday 99
## 8 CAKE 94
## 9 Decoding the Gurus 89
## 10 St. Vincent 81
## # … with 999 more rows
spotify <- spotify %>%
filter(!artistName %in% c("Very Bad Wizards", "Unknown Artist", "Decoding the Gurus", "Taskmaster The Podcast", "Two Psychologists Four Beers", "Better Call Saul Insider Podcast", "Off Menu with Ed Gamble and James Acaster"))
spotify <- spotify %>%
group_by(artistName) %>%
mutate(plays = n())
The first thought I had was to plot listening trends over the year, which would show which artists I was most interested in at different times in the year, and how those trends waxed and waned over time. It would be quite unwieldy to do this with the whole dataset, so instead I just selected, to start with, the top 21 artists of 2022 (for me).
spotify_top <- spotify %>%
filter(plays >= 40)
spotify_top %>%
count(artistName) %>%
arrange(desc(n))
## # A tibble: 21 × 2
## # Groups: artistName [21]
## artistName n
## <chr> <int>
## 1 The Mountain Goats 710
## 2 The Beatles 140
## 3 Japanese Breakfast 117
## 4 Phoebe Bridgers 102
## 5 Wednesday 99
## 6 CAKE 94
## 7 St. Vincent 81
## 8 Sleater-Kinney 77
## 9 R.E.M. 70
## 10 AJJ 68
## # … with 11 more rows
I started by making a rainbow colored scatter plot, that doesn’t really tell us anything at all, although it does, in its own way, chart the chronology of me listening to all these bands.
ggplot(spotify_top, aes(y = artistName, x = endTime, color = artistName))+
geom_point(position = "jitter")+
theme(axis.text.y = element_blank())
What I’d like to do in order to get some usable dates for our purposes is to now get a usable date variable, by reshaping our dataset using the lubridate package to extract some dates for us.
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
What I ultimately want is a dataset that has each artist, the month of 2022, and the number of times each artist was played for a given month.
spotify_top$month <- month(ymd_hm(spotify_top$endTime))
spotify_top <- spotify_top %>%
group_by(month, artistName) %>%
mutate(monthly_plays = n())
spotify_monthly <- aggregate(monthly_plays ~ month + artistName, data = spotify_top, FUN = mean)
Now that I have it, let’s try an initial plot! This is a decent starting place, although it is quite messy.
ggplot(spotify_monthly, aes(x = month, y = monthly_plays, color = artistName))+
geom_line()
Let’s see if we can improve upon our inital plot. To start with, there’s maybe too much noise at the bottom from bands who aren’t getting too many plays. I decided to winnow it down to just the top 7. Now, it’s not completely impossible to see the trend line for each artist, and see how many listens each got month to month, as well as compare them to others.
spotify_top8 <- spotify_top %>%
filter(plays > 80)
spotify_monthly <- aggregate(monthly_plays ~ month + artistName + plays, data = spotify_top8, FUN = mean)
ggplot(spotify_monthly, aes(x = month, y = monthly_plays, color = fct_reorder(artistName, plays, .desc = TRUE)))+
geom_line(linewidth = 1.3)
Let’s make some more improvements. First of all, let’s change the x-axis to show the month names for each month. Let’s also add some helpful labels and a title. I also added a new aesthetic to further differentiate the lines and recolored them to try and make theme stand out more against eachother.
spotify_monthly <- spotify_monthly %>%
mutate(month_name = month.name[spotify_monthly$month])
ggplot(spotify_monthly, aes(x = fct_reorder(month_name, month), y = monthly_plays, group = fct_reorder(artistName, plays, .desc = TRUE), color = fct_reorder(artistName, plays)))+
geom_line(lineend = "round", aes(linewidth = fct_reorder(artistName, plays)))+
scale_color_viridis_d()+
theme_bw()+
labs(title = "Listening Trends for Ben's top artists (2022)",
x = "Month",
y = "Plays",
color = "Artist",
linewidth = "Artist")
## Warning: Using linewidth for a discrete variable is not advised.
I wanted to make one more change, to allow out figure to be wider, by putting the legend inside the white space of the figure. I also tried out a new color scheme, which I thought might make the colors stand out against each other better, but which also makes the lines look sort of like a bunch of wriggly worms.
spotify_monthly <- spotify_monthly %>%
mutate(month_name = month.name[spotify_monthly$month])
ggplot(spotify_monthly, aes(x = fct_reorder(month_name, month), y = monthly_plays, group = fct_reorder(artistName, plays, .desc = TRUE), color = fct_reorder(artistName, plays)))+
geom_line(lineend = "round",
aes(linewidth = fct_reorder(artistName, plays)))+
guides(color = guide_legend(reverse = TRUE), linewidth = guide_legend(reverse = TRUE))+
scale_color_viridis_d(option = "F", direction = 1)+
theme_classic()+
theme(
legend.position = c(.95, .95),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6))+
labs(title = "Listening Trends for Ben's top artists (2022)",
x = "Month",
y = "Plays",
color = "Ben's top artists",
linewidth = "Ben's top artists")
## Warning: Using linewidth for a discrete variable is not advised.
At this point, I was getting a bit bored with this concept, and had a different idea for a visualization that I was more excited about. Now, let’s move on to that.
What I would like to do now is create a figure that plots the artists I listened to in 2022 against a map of the world, with each artist having a point on the graph corresponding to their hometown. I think this could be an interesting figure to look at, and would also be sort of an interesting look into where the music I heard “comes from”.
First, let’s get a larger sample of artists, and exclude a couple more podcasts from the mix.
artist_plays <- spotify %>%
count(artistName) %>%
filter(n > 11) %>%
filter(!artistName %in% c("You're Wrong About", "My Brother, My Brother And Me"))
In between all these r chunks, I wrote a couple of r scripts to accomplish the webscraping needed to get the data I’m after. All the data was scraped from Wikipedia.
First, I wrote a function that scrapes (a) the artist name, and (b) the “Origin” town from a given wikipedia pages infobox. Incidentally, because of the way the html nodes in these infoboxes work, it also scrapes a lot of other irrelevant information, which we’ll deal with later.
Second, I wrote a function that scrapes the latitude and longitude coordinates from a given location’s wikipedia page.
Third, I wrote a script that (i) gets the wikipedia page for all my top artists with wikipedia page, using the glue function, (ii) makes a dataframe containing artists and the wikipedia links for their hometowns (as well as a bunch of NAs and irrelevant information) using the scrape_town function, (iii) then makes a dataframe containing all of these hometowns matched to their coordinates using the scape_coords function. Because only the hometowns we scraped in step 2 have coordinate information in their wikipedia pages, they are tho only ones that produce data in step 3. Additionally, NAs and broken/non-scrapable links are filtered out in step 3, so that the code is able to run.
Now, we have 3 relevant dataframes, one with number of plays per artist, one with each artist’s hometown, and one with the coordinates of each hometown. We want to join these into one so that we can plot artists on the map according the coordinates of their hometown.
artist_towns <- read.csv("data/artist_towns.csv")
origin_coords <- read.csv("data/origin_coords.csv")
artist_map <- inner_join(artist_plays, artist_towns, by = "artistName")
## Warning in inner_join(artist_plays, artist_towns, by = "artistName"): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 2 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
artist_map <- inner_join(artist_map, origin_coords, by = "origin") %>%
filter(!is.na(long))
## Warning in inner_join(artist_map, origin_coords, by = "origin"): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 11 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
One last thing that has to happen is that the coordinates have to be converted into a usable form. They are currently expressed as degrees, but we need to convert them into decimals so that we can plot them.
Luckily, Barbosa et al. have produced a function that does just this, which I pasted below and then applied to our coordinate data!
artist_map$long.dec <- dms2dec(artist_map$long)
artist_map$lat.dec <- dms2dec(artist_map$lat)
#Thanks to Barbosa et al for this very useful function! credit:https://www.r-bloggers.com/2022/02/degree-minute-second-to-decimal-coordinates/
Now let’s try and plot out data! The first time I tried it, I created this, which is not helpful, but is sort of strangely beautiful imo.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
world <- map_data("world")
ggplot(artist_map, aes(x = long.dec, y = lat.dec))+
geom_point()+
geom_polygon(data = world, aes(x = long, y = lat))
Now, let’s do it for real. It worked! I’ve plotted it out so that artists with more plays got larger points on the map. According to this map, it looks like most of my top artists started out in the US, with a particular stronghold in southern California and up the east coast. However, there is a reasonable amount of international content here, including some representation from Iceland, Armenia, and New Zealand.
ggplot()+
geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "grey90")+
geom_point(data = artist_map, aes(x = long.dec, y = (lat.dec), size = n, label = artistName), color = "skyblue3", alpha = .7)+
theme_void()+
theme(legend.position = "none")
## Warning in geom_point(data = artist_map, aes(x = long.dec, y = (lat.dec), :
## Ignoring unknown aesthetics: label
The last thing I wanted to do is make this plot interacting, so that one could zoom in and mouse over particular bubbles to see which artist each one represents. Looking around this some revealed that my scraping and joining method wasn’t perfect, because some of the artists that should be in here are missing. But, at this stage I am satisfied and ready to move onto my next portfolio project. :)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
music_map <-
ggplot()+
geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "grey90")+
geom_point(data = artist_map, aes(x = long.dec, y = (lat.dec), size = n, label = artistName), color = "skyblue3", alpha = .7)+
theme_void()+
theme(legend.position = "none")
## Warning in geom_point(data = artist_map, aes(x = long.dec, y = (lat.dec), :
## Ignoring unknown aesthetics: label
ggplotly(music_map, tooltip = "label")